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Abstract 

We present new algorithms for simulation optimization using random directions stochastic approxi¬ 
mation (RDSA). These include first-order (gradient) as well as second-order (Newton) schemes. We in¬ 
corporate both continuous-valued as well as discrete-valued perturbations into both our algorithms. The 
former are chosen to be independent and identically distributed (i.i.d.) symmetric uniformly distributed 
random variables (r.v.), while the latter are i.i.d., asymmetric Bernoulli r.v.s. Our Newton algorithm, 
with a novel Hessian estimation scheme, requires A<-dimensional perturbations and three loss measure¬ 
ments per iteration, whereas the simultaneous perturbation Newton search algorithm of 0 requires 
2 TV-dimensional perturbations and four loss measurements per iteration. We prove the unbiasedness 
of both gradient and Hessian estimates and asymptotic (strong) convergence for both first-order and 
second-order schemes. We also provide asymptotic normality results, which in particular establish that 
the asymmetric Bernoulli variant of Newton RDSA method is better than 2SPSA of 0. Numerical 
experiments are used to validate the theoretical results. 


1 Introduction 


Problems of optimization under uncertainty arise in many areas of engineering and science, such as signal 
processing, operations research, computer networks, and manufacturing systems. The problems themselves 
may involve system identification, model fitting, optimal control, or performance evaluation based on ob¬ 
served data. We consider the following optimization problem in N dimensions: 


Find x* = are; min fix). 

x<ZR N V J 


( 1 ) 


We operate in a simulation optimization setting 114], i.e., we are given only noise-corrupted measurements 
of /(•) and the goal is to devise an iterative scheme that is robust to noise. We assume that, at any time 
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instant n, the conditional expectation of the noise, say £ n , given all the randomness up to n is zero (see 
assumption (A2) in Section [272] for the precise requirement). 

A general stochastic approximation algorithm [260 finds the zeros of a given objective function when 
only noisy estimates are available. Such a scheme can also be applied to gradient search provided one 
has access to estimates of the objective function gradient. Schemes based on sample gradients include 
perturbation analysis (PA) [Ijl] and likelihood ratio (LR) [23] methods, which typically require only one 
simulation run per gradient estimate, but are not universally applicable. 

Algorithms based on gradient search perform the following iterative update: 


S'Tl+l — Q n \7 f(Xj i), 


( 2 ) 


where a n are step-sizes that are set in advance and satisfy standard stochastic approximation conditions (see 
(A2) in Section [2]) and V/(-) is an estimate of the gradient V/(-) of the objective function /. 

The finite difference Kiefer-Wolfowitz (KW) 112011 estimates require 2 N system simulations for a gra¬ 
dient estimate V/. This makes the scheme, also called finite difference stochastic approximation (FDSA), 
disadvantageous for large parameter dimensions. The random directions stochastic approximation (RDSA) 


approach [21, pp. 58-60] alleviates this problem by requiring two system simulations regardless of the 


parameter dimension. It does this by randomly perturbing all the N parameter component directions us¬ 
ing independent random vectors that are uniformly distributed over the surface of the ./V-dimensional unit 
sphere. It has been observed in [0], see also [27], 0], 0], that RDSA also works in the case when the 
component perturbations are independent Gaussian or Cauchy distributed. 


RDSA with Gaussian perturbations has also been independently derived in [19] by approximating the 
gradient of the expected performance by its convolution with a multivariate Gaussian that is then seen (via 
an integration-by-parts argument) as a convolution of the objective function with a scaled Gaussian. This 
procedure requires only one simulation (regardless of the parameter dimension) with a perturbed parameter 
(vector) whose component directions are perturbed using independent standard Gaussian random variables. 
A two-simulation finite difference version that has lower bias than the aforementioned RDSA scheme is 
studied in [d3tl, 0,0. 

Among all gradient-based random perturbation approaches involving sample measurements of the ob¬ 
jective function, the simultaneous perturbation stochastic approximation (SPSA) of [290 has been the most 
popular and widely studied in applications, largely due to its ease of implementation and observed numer¬ 
ical performance when compared with other approaches. Here each component direction of the parameter 
is perturbed using independent, zero mean, symmetric Bernoulli distributed random variables. In 0, per¬ 
formance comparisons between RDSA, SPSA, and FDSA have been studied through both analytical and 
numerical comparisons of the mean square error metric, and it is observed that SPSA outperforms both 
RDSA with Gaussian perturbations and FDSA. 

Within the class of simulation-based search methods, there are also methods that estimate the Hessian 
in addition to the gradient. Such methods perform the following update: 


Xn+ 1 — 2-71 &n(H n ) ^ f {%n) 


(3) 


where V/(x n ) is an estimate of the gradient V/(-) as before, while H n is an N x ./V-dimensional matrix 
estimating the true Hessian V 2 /(x*). Thus, ([3]) can be seen to be the stochastic version of the well-known 
Newton method for optimization. 

Stochastic Newton methods are often more accurate than simple gradient search schemes, which are 
sensitive to the choice of the constant ao in the canonical step-size, a n = an/ n. The optimal (asymptotic) 
convergence rate is obtained only if ao > 1/3Ao, where Ao is the minimum eigenvalue of the Hessian of the 
objective function (see [11]). However, this dependency is problematic, as Ao is unknown in a simulation 
optimization setting. Hessian-based methods get rid of this dependency, while attaining the optimal rate 
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(one can set ao = 1)- An alternative approach to achieve the same effect is to employ Polyak-Ruppert 
averaging, which uses larger step-sizes and averages the iterates. However, iterate averaging is optimal only 
in an asymptotic sense. Finite-sample analysis (see Theorem 2.4 in 112]) shows that the initial error (that 
depends on the starting point xq of the algorithm) is not forgotten sub-exponentially fast, but at the rate 1/n 
where n is the number of iterations. Thus, the effect of averaging kicks in only after enough iterations have 
passed and the bulk of the iterates are centered around the optimum. 

In [llOll . the Hessian is estimated using 0(N 2 ) samples of the cost objective at each iterate, while in 
[28] the Hessian is estimated assuming knowledge of objective function gradients. During the course of 
the last fifteen years, there has been considerable research activity aimed at developing adaptive Newton- 
based random search algorithms for stochastic optimization. In [3(1]. the first adaptive Newton algorithm 
using the simultaneous perturbation method was proposed. The latter algorithm involves the generation 
of 2N independent symmetric Bernoulli distributed random variables at each update epoch. The Hessian 
estimator in this algorithm requires four parallel simulations with different perturbed parameters at each 
update epoch. Two of these simulations arc also used for gradient estimation. The Hessian estimator is 
projected to the space of positive definite and symmetric matrices at each iterate for the algorithm to progress 
along a descent direction. In |Q]], three other simultaneous perturbation estimators of the Hessian that require 
three, two, and one simulation(s) have been proposed in the context of long-run average cost objectives. The 
resulting algorithms incorporate two-timescale stochastic approximation, see Chapter 6 of [5 k Certain three- 
simulation balanced simultaneous perturbation Hessian estimates have been proposed in [30. In addition, 
certain Hessian inversion procedures that require lower computational effort have also been proposed, see 
also 0]. In 134], a similar algorithm as in [30] is considered except that for computational simplicity, the 
geometric mean of the eigenvalues (projected to the positive half line) is used in place of the Hessian inverse 
in the parameter update step. In 0211 . certain enhancements to the four-simulation Hessian estimates of 
[30] using some feedback and weighting mechanisms have been proposed. In Q], Newton-based smoothed 
functional algorithms based on Gaussian perturbations have been proposed. An overview of random search 
approaches (both gradient and Newton-based) involving both theory and application of these techniques is 
available in [4J]. 

A related body of work in the machine learning community is bandit optimization [17]. Gradient esti¬ 
mates using the principle of first-order RDSA schemes have been used in [[130. In 125], the authors explore 
smoothing using Gaussian perturbations for stochastic convex optimization problems and establish optimal 
convergence rates for this setting. In @], the authors establish optimal (non-asymptotic) convergence rates 
for first-order RDSA gradient estimate coupled with a mirror descent scheme. 

The principal aim of this paper is to develop an RDSA-based second-order method. A related objective 
is to achieve convergence guarantees similar to 2SPSA in |t3ci], preferably at a lower per-iteration cost. A 
key ingredient in any RDSA scheme is the choice of random perturbations. We propose two schemes in this 
regard. The first scheme employs i.i.d. uniform [— 77 , rj\ (for some rj > 0) perturbations, while the second 
scheme employs i.i.d. asymmetric Bernoulli perturbations. The latter takes values —1 and 1 + e (for some 


small e > 0 ) with probabilities 


(l±i 

V2+e 


and ( 


), respectively. 

As evident from the update rule ([3]), the performance of any second-order method is affected by the 
choices of both the gradient estimation scheme and the Hessian estimation scheme. This motivates our 
study of first-order RDSA schemes with the aforementioned two perturbation schemes prior to analysing 
the second-order RDSA algorithms. Table [j] presents a classification of our algorithms based on their - order 
and the perturbations employed. 

We summarize our contributions below. 


Stochastic Newton method We propose an adaptive Newton-based RDSA algorithm. The benefits of 
using our procedure are two-fold. First, the algorithm requires generating only N perturbation variates at 
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Table 1: A taxonomy of proposed algorithms. 


Order —>- 

Perturbations 

First-order 

Second-order 

i 



Uniform [— 77 , 77 ] 

lRDSA-Unif 

2RDSA-Unif 

Asymmetric Bernoulli { — 1,1 + e} 

lRDSA-AsymBer 

2RDSA-AsymBer 





each iteration even for the Newton scheme (N being the parameter dimension), unlike the simultaneous 
perturbation Newton algorithms of 0 ], H, 0 , jin , 1340, which require generation of 2 N perturbation 
variates. Second, the number of system simulations required per iteration in our procedure is three, whereas 
the procedure in |f3()fl requires four. 

Stochastic gradient method We also propose a gradient RDSA scheme that can be used with either 
uniform or asymmetric Bernoulli perturbations. Our gradient estimates require two parallel simulations 
with (randomly) perturbed parameters. 


Convergence and rates We prove unbiasedness of our gradient and Hessian estimators and prove almost 
sure convergence of our algorithms to local minima of the objective function. We also present asymptotic 
normality results that help us analyse our algorithms from an asymptotic mean square error (AMSE) view¬ 
point. From this analysis, we observe that 

(i) On all problem instances, the asymmetric Bernoulli variant of 2RDSA results in an AMSE that is better 
than 2SPSA, for the same number of iterations. It is computationally advantageous to employ our adaptive 
RDSA scheme as it requires three simulations per iteration (2SPSA requires four per iteration). 

(ii) On many problem instances, the adaptive RDSA scheme with uniform perturbations that we propose 
results in an AMSE that is better than the Newton algorithm (2SPSA) of [30]. In particular, the question 
of whether 2SPSA or our adaptive RDSA scheme is better depends on the values of problem-dependent 
(equivalently, algorithm-independent) quantities (see (A) and (B) in Section [3731) . and there are many prob¬ 
lem instances where 2RDSA with uniform perturbations is indeed better than 2SPSA. 

(iii) The gradient algorithm 1RDSA with asymmetric Bernoulli perturbations exhibits an AMSE that is 
nearly comparable to that of SPSA, while 1RDSA variants with Gaussian [21,3] and uniform perturbations 
perform worse. 


Experiments Numerical results using two objective functions - one quadratic and the other fourth-order - 
show that 

(i) asymmetric Bernoulli variant of 1RDSA performs on par with 1SPSA of lE^l : and 

(ii) our Newton algorithm 2RDSA-AsymBer provides better accuracy levels than the Newton algorithm 
2SPS A in [ 30] for the same number of iterations, despite 2RDS A requiring only 75% of the cost per-iteration 
as compared to 2SPSA. 

The rest of the paper is organized as follows: In Section [2], we describe the first-order RDSA algorithm 
with the accompanying asymptotic theory. In Section[3j we present the second-order RDSA algorithm along 
with proofs of convergence and asymptotic rate results. We present the results from numerical experiments 
in Section[4]and provide concluding remarks in Section[5] 
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2 First-order random directions SA (1RDSA) 
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Figure 1: Overall flow of 1-RDSA algorithm. 


Recall that a first-order gradient search scheme for solving ([Hi has the following form: 

x n+ i = x n - a n X7f(x n ), (4) 

where V/(x n ) is an estimate of V/(.r n ). As illustrated in Fig. HI the idea behind an RDSA scheme is to 
obtain noisy measurements of / at parameter values x n + S n d n and x n — 5 n d n . Denote these respective 
values by and y~, i.e., 

Vn = f( x n + Sndn) + £n , lIn = f( x n ~ S n d n ) + C- 

In the above, the noise tuple > 0} is a martingale difference sequence, the sequence of the 

perturbation constants {5 n ,n > 0} is a positive and asymptotically vanishing sequence and the random 
perturbations d n = ..., d^) J are such that {d l n , i = 1,..., N, n = 1, 2,...} are i.i.d. and independent 

of the noise sequence. These quantities are assumed to satisfy the conditions in (A2)-(A5) in Section [3721 
below. 

In the next section, we specify two different choices for d n for obtaining the gradient estimate using the 
noisy function measurements and y~, respectively. The first choice uses (continuous-valued) uniform 
random variables, while the second is based on (discrete-valued) asymmetric Bernoulli random variates. 


2.1 Gradient estimate 


Uniform perturbations 


Choose d l n , i = to be i.i.d. t/[— 77, 77 ] for some rj > 0, where U[—rj,r]] denotes the uniform 

distribution on the interval [— 77 , rf\. The RDSA estimate of the gradient is given by 


V/(s n ) 



Vn Vn 

2 S n 


(5) 


Asymmetric Bernoulli perturbations 

Choose d\, i = 1,..., N, i.i.d. as follows: 


dl 



< 

1 + e 


w.p. 

w.p. 


(1 + 0 

(2 + 6 )’ 

(2 + 0 ’ 


( 6 ) 
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where e > 0 is a constant that can be chosen to be arbitrarily small. Note that Edt = 0, E(dL r = 1 + e 

a n d£(*) 4 = fi - +e)(1 + <1 + £)3) 


(2 + e) 

Then, the RDSA estimate of the gradient is given by 


V/(x n ) = 


1 


1 + e 


- <I 7 . 


Vn ~ Vn 
2 8 n 


(V) 


For notational simplicity, we use V/(x n ) to denote the gradient estimate for both uniform and asym¬ 
metric Bernoulli distributions, where the underlying perturbations should be clear from the context. 


Motivation for the gradient estimates 


Lenmia[Tjbelow establishes that the gradient estimates in © and ([7]) are biased by a term of order 0(5%), 
and this bias vanishes since 5 n —> 0 (see (A5) below). The proof uses suitable Taylor’s series expansions 
(as in [ji^]) to obtain the following for both uniform and asymmetric Bernoulli perturbations: 

f(x n ± S n d n ) = f(x n ) ± 6 n d T n Vf(x n ) + y d J n V 2 f(x n )d n + 0(6^). 

Flence, as is shown in the proof of Lemma [T] 


E 



f(%n T ^ ndtn) f(x r , 

2 S n 




E[dnd T n ] Vf(x n ) + 0(8 2 n ), 


where T n = o(x rn , rn < n) denotes the underlying sigma-field. For the case of uniform perturbations, it is 
easy to see that E [d n df] = and since we have a scaling factor of -f in ©, the correctness of gradient 
estimate follows as 5 n —> 0. A similar argument holds for the case of asymmetric Bernoulli perturbations. 


Remark 1. (Why uniform/asymmetric Bernoulli perturbations?) Previous studies, see [21 Section 2.3.5] 
and 161), assumed the perturbation vector d n to be uniformly distributed over the surface of the unit sphere, 
whereas we consider alternative uniform/asymmetric Bernoulli perturbations for the following reasons: 


Sample efficiency: Let fiRDSA-Unif, ftRDSA-AsymBer and fiRosA-Gaussian denote the number of function mea¬ 
surements required to achieve a given accuracy using uniform, asymmetric Bernoulli and Gaussian 
distributed perturbations in 1RDSA, respectively. Further, let hspsA denote a similar number for the 
regular SPSA scheme with symmetric Bernoulli perturbations. Then, as discussed in detail in Section 
P we have the following ratio: 


ftRDSA-Unif '■ flRDSA-AsymBer ■ nRDSA-Gaussian '■ ^SPSA — 1-8 : (7 + e) : 3 : 1. 

Notice that RDSA with Gaussian perturbations requires many more measurements (3 times) than 
regular SPSA. Uniform perturbations bring down this ratio, but they are still significantly sub-optimal 
in comparison to SPSA. On the other hand, asymmetric Bernoulli perturbations exhibit the best ratio 
that can be made arbitrarily close to 1, by choosing the distribution parameter e to be a very small 
positive constant. 


Computation: Generating perturbations uniformly distributed over the surface of the unit sphere involves 
simulating N Gaussian random variables, followed by normalization / \24]. In comparison, uni¬ 
form/asymmetric Bernoulli perturbations are easier to generate and do not involve normalization. 


Remark 2. (Convex Optimization) In /@/, an RDSA-based gradient estimate has been successfully em¬ 
ployed for stochastic convex optimization, where the optimal 0(n~ 1//2 ) rate can be obtained. As in /6/. the 
authors in impose the condition that E [d n df] = I, where I is the identity matrix and suggest Gaussian 
random variables as one possibility. It is easy to see that uniform and asymmetric Bernoulli perturbations 
work well in the stochastic convex optimization setting as well. 
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2.2 Main results 

Recall that T n = er(x m , m <n) denotes the underlying sigma-field. We make the following assumption^]: 

(Al) / : —> R is three-times continuously differentiabl^ with |Vf li 2 j 3 /(x)| < cto < oo, for 

it,* 2,*3 = 1 ,N and for all x G M. N . 


(A2) 

(A3) 

(A4) 

(A5) 


{£n > £n i n = !. 2 > • • •} satisf y E [£n - £n I d n, -Fn] = 0. 

For some op, « 2 , C > 0 and for all n, E |£^| 2+< ’ < a\, E | f(x n ± S n d n ) J 24 '*’ < ct 2 . 
{d 2 n , i = 1,..., N, n = 1,2,...} are i.i.d. and independent of T n . 

The step-sizes a n and perturbation constants 5 n are positive, for all n and satisfy 


a n , 5 n —» 0 as n —> 00 , 


a„ = 00 


and 


< 00 . 


(A 6 ) sup n ||at n || < 00 w.p. 1 . 

The above assumptions are standard in the analysis of simultaneous perturbation methods, cf. 0]. In 
particular', 


• (Al) is required to ensure the underlying ODE is well-posed and also for establishing the asymptotic 
unbiasedness of the RDSA-based gradient estimates. A similar assumption is required for regular 
SPSA as well (see Lemma 1 in ll29h ). 

• (A2) requires that the noise £+, £“ is a martingale difference for all n, while the second moment 
bounds in (A3) are necessary to ensure that the effect of noise can be ignored in the (asymptotic) 
analysis of the 1RDSA recursion ©. 

• (A4) is crucial in establishing that the gradient estimates in © and © are unbiased in an asymptotic 
sense, because one obtains terms of the form E(d n £n I -E n ) after separating the function value fix,, ± 
5 n d n ) and the noise ffy in ©/©. The independence requirement in (A4) ensures that E(d n (£+—£“) | 
F n ) = E(d n E((£+ — £~) | d n ,J- n )) = 0. See Lemma[T|for the proof details that utilise (A4). 


• The step-size conditions in (A5) are standard stochastic approximation requirements, while the con¬ 
dition that Yh n (jr) < 00 is necessary to bound a certain martingale difference term that arises in 
the analysis of ©. See the proof of Theorem [2] 


(A 6 ) is a stability assumption required to ensure that © converges and is common to the analysis of 
stochastic approximation algorithms, which include simultaneous perturbation schemes (cf. Si, 
i). Note that (A 6 ) is not straightforward to show in many scenarios. Flowever, a standard trick to 
ensure boundedness is to project the iterate x n onto a compact and convex set - see the discussion on 
pp. 40-41 of [21] and also remark E.l of 0. 


We next present three results that hold for uniform as well as asymmetric Bernoulli perturbations: First, 
Lemma Q] establishes that the bias in the gradient estimates © and © is of the order Oidf ). Second, 
Theorem [2] proves that the iterate x n governed by © converges a.s. and finally, Theorem [3] provides a 
central limit theorem-type result. 


1 All norms are taken to be the Euclidean norm. 

d 3 f(x) 


" CKV f{x) = OsWQF 
for *!, *2, *3 = 1 , ..., 1 V. 


denotes the third derivate of / at x and Vf liai /(a:) denotes the (iit2J3)th entry of V 3 /(a:) 
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Lemma 1. (Bias in the gradient estimate) Under (A1)-(A6), for V/(.x' n ) defined according to either © or 
we have a.s. tha H 


E 


Vi/(x„) T n ~ Vi/(s n ) = 0(<5 2 ), for i = 1,..., N. 


(B) 


Proof. We use the proof technique of |29] (in particular, Lemma 1 there) in order to prove the main claim 
here. 

Notice that 


E 


'Vn-Vn 

T 

[ 2 5 n 

J n 


=E 

=E 


dr) 


dn 


f (%n A d n dri) f (%n 8 n d n ) 
f -|- 5 n d n ) f ( x n S n d n ) 

2L. 


T 

J n 

T 

J n 


+ E 


dr 7 


£+ — £“ 
2 <L 


T n 


The last equality above follows from (A2) and (A4). We now analyse the term on the RHS above for both 
uniformly distributed perturbations and asymmetric Bernoulli perturbations. 

Case 1: Uniform perturbations 

Let V 2 /(-) denote the Hessian of /. By Taylor’s series expansions, we obtain, a.s., 


<5, 2 


<5 3 


f(x n ± 8 n d n ) = f(x n ) ± 5 n d J n \7 f(x n ) + y dfS7 2 f(x n )d n ± y V 3 /(x+)(d™ <g> <g> d n ), 

where <g> denotes the Kronecker product and xf (resp. xf) are on the line segment between x n and (x n + 
5 n d n ) (resp. ( x n - 5 n d n )). Hence, 


E 


drr 


f (%n A 8 n dn) f (x n 8 n d n ) 


2 ( 5,1 


Tn 


=E \d n d\ V/(x n )| T n ]+E 

The first term on the RHS above can be simplified as follows: 


^d n (\/ 3 f(x+) + V 3 /(x“))(d n ® d n ® d n 


T 

J n. 


(9) 


E [d n d T n ] V/(x„) 




’ (4) 2 d l n d 2 n ■ 

1 


E 

44 (dn) 2 ■ 

■ 

v/(x n ) 


. dn d l n d% d 2 n ■ 

■ «) 2 J 



rff 

3 


( 10 ) 


In the above, the first equality follows from (A4) and the last equality in ( fTOl ) follows from E[(rf' n ) 2 ] = fir 
and E [df n dl] = E[e^]E[d£] = 0 for i / j. 

Now, the Zth coordinate of the second term in the RHS of © can be upper-bounded as follows: 


E 


§<(V 3 /(x+) + V 3 /(x-))(d„ ® dn < 8 . dn 
2 N N N 


T 

J n. 


<aodn J2 J2 J2 E 

*1=1 *2=1 *3=1 
aoS^^N 3 


< 


6 


(ID 


’Here Vif(x n ) and Vif(x n ) denote the ith coordinates in the gradient estimate V/( x„) and true gradient V f(x n ), respec¬ 
tively. 






































The first inequality above follows from (A 1), while the second inequality follows from the fact that \d l n \ < // , 
l = 1,..., N. The claim follows by plugging (fTOl) and (ITTh into © . 

Case 2: Asymmetric Bernoulli perturbations 

The proof follows in an analogous fashion as above, after noting that the scaling factor of in ([7]) 


cancels out E[(d^) 2 ] 


(1 + e) and the bound in (fTTb gets replaced by 


^ qq(5 2 (1 + e) 4 A r3 ^ 


□ 


Theorem 2. (Strong Convergence) Let x* be an asymptotically stable equilibrium of the following ordi¬ 
nary differential equation (ODE): xt = —V/(x*), with domain of attraction D(x*), i. e., D(x*) = {tco | 
lirn^oo x(t | ato) = x*}, where x(t \ Xo) is the solution to the ODE with initial condition xq. Assume 
(A1 )-(A6) and also that there exists a compact subset D of D(x*) such that x n € D infinitely often. Here x n 
is governed by © with the gradient estimate V/(xy,.) defined according to either © or ©. Then, we have 


x a.s. as n 


oo. 


Proof. As in the case of regular SPSA algorithm from [29D, the proof involves verifying assumptions A2.2.1 
to A2.2.3 and A2.2.4” of [21] in order to invoke Theorem 2.3.1 there. The reader is referred to Appendix lAl 
for the detailed proof. □ 


We now present an asymptotic normality result for 1RDSA, for which we require the following variant 
of (A3): 

(A3’) The conditions of (A3) hold. In addition, E(/+ — ff ) 2 —> o 2 a.s. as n —t oc. 

The main result is as follows: 


Theorem 3. (Asymptotic Normality) Assume (Al), (A2), (A3’), (A4)-(A6). Let a n = ao/n a and d n = 
do/n 7 , where ao, do > 0, a E (0,1] and 7 > 1/6. Let (3 = a —2y > 0 and P be an orthogonal matrix with 

PX7 2 f(x)P J = — diag (Ai,..., Xn). Then, 
a 0 

n^ 2 (i n — x*) ^4 M(p, PMP T ) as n —>• 00 , (12) 

where Af(p, PMP T ) denotes the multivariate Gaussian distribution with mean p and covariance matrix 
PMP T . The mean p is defined as follows: p = 0 if y > a /6 and p = fc M (ao<5o(2aoV 2 /(.T*) — (3 + L)~ l T) 
if 7 = a/ 6 , where 


= 


3.6 

2 (l + 6 )(l + (l + e) 3 ) 
(2 + e)(l + e ) 2 


for uniform perturbations, 

for asymmetric Bernoulli perturbations. 


In the above, I is the identity matrix of size N xN, /3 + = (3 if a = 1 and 0 if a < 1 and T = (T 1 ,... , T n ) t 
with 


rpL __ 

6 


V?„/(**) + 3 


N 

E 

i=l,i^l 


V 3 ul f(x* 


,l = 1 , 


, N. 


The covariance matrix M is defined as follows: 

M = i|r^(( 2A 1 - / 3+ )" 1 ’ ■ ■ ■ > ( 2A ^ - / 3+ ) -1 )- 
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Proof. Follows from Proposition 1 of jfx] after observing the following facts: 

3 9 

Uniform perturbations: —E [d n d J n \ = / and — E[(d£j 4 ] = 1.8 for any i = 1,.. 
Asymmetric Bernoulli perturbations: ———-E [d n df] = I and 


, N. 


1 


(i + 0 : 


rE IK) 4 ] = 


(l + e)(l + (l + e) 3 ) 
(2 + e)(l + e) 2 


(1 + c) 


for any i = 1,..., N. 


□ 


2.3 (Asymptotic) convergence rates 

The result in Theorem [3] shows that nf! 2 (x n — x*) is asymptotically Gaussian for 1RDSA under both 
perturbation choices. The asymptotic mean square error of rf^ 2 (x n — x*), denoted by AAAS£ \ -r;dsa ( a ■ c), 
is given by 

AMSSmvsA{aoi <5o) = fV + trace(PMP T ), 


where ao is the step-size constant, So is the constant in the perturbation sequence Sk and p, P and M are 
as defined in Theorem [3j Under certain assumptions (cf. 150), it can be shown that AAAS£ \ -r;dsa ( a • c) 
coincides with rfl E \\x n — x *\\ 2 . From Theorem [3 it is easy to deduce from the conditions on step-size 
exponent a and perturbation constant exponent 7 that the range of /3 is 0 to 2/3. Following the discussion 
in Section III-A of |@], a common value of f3 = 2/3 is optimal for all first-order algorithms, with a = 1 and 


7 = 1 / 6 . 

With step-size a n = ao/n, setting ao optimally requires knowledge of the minimum eigenvalue Ao of 
the Flessian V 2 /(x*), i.e., ao > /3/2 Aq- Under this choice, we obtain 


AMS£ m vsA-Unif(ao,6 0 ) = (3.6<5gao ||(2a 0 V 2 /(x*) - ^)~ 1 T ||) 2 

+ 5(/ 2 trace ((2a 0 V 2 /(x*) - /3) _ 1 5) , 


where T is as defined in Theorem [3] and S = —I. 

Remark 3. (On step-size dependency) Since Ao is unknown, obtaining the above rate is problematic and 
one can get rid of the dependency of ao on Ao either by averaging of iterates or employing an adaptive 
(second-order) scheme. The former would employ step-size a n = a 0 /n Q , with a € (1/2,1) and couple this 
choice with averaging of iterates as x n = 1 /n 1 x 'm- The latter adaptive scheme would correspond to 

2RDSA, which performs a Newton step to update x n in (1131) . Section \3\p resents 2RDSA along with an AMSE 
analysis that compares to 2SPSA. 

Comparing AMSE of lRDSA-Unif to that of 1SPSA 

Taking the ratio of AMSE of 1RDSA with uniform perturbations to that of 1SPSA with symmetric Bernoulli 
±l-valued perturbations, we obtain: 

AMS£mvsA-unif(ao, Ao) 

AMSSispsA( a o, S 0 ) 

_ (2S 2 ao ||(2a 0 V 2 /(%*) - p)~ l T\\ 1.8) 2 + apc^ti-ace ((2a 0 V 2 /(x*) - fS^S ) 

(2<5 2 a 0 ||(2a 0 V 2 /(x*) - /3)- 1 T||) 2 + a 0 ^ 2 trace ((2a 0 V 2 /(x*) - /3)-'S) 

2.24 

1 + (a 0 <5 0 " 2 ti-ace ((2 a 0 V 2 f(x*) - /3)~ 2 S)) / (2<5 2 a 0 ||(2a 0 V 2 /(x*) - P)~'T\\) 2 ' 
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From the above, we observe that ISPS A has a better AMSE in comparison to 1RDSA, but it is not clear 
if the difference is ‘large’. This is because the ratio in the denominator above depends on the objective 
function (via V 2 /(a+) and T) and a high ratio value would make the difference between 1RDSA and 1SPSA 
negligible. Contrast this with the 1.8 : 1 ratio obtained if one knows the underlying objective function (see 
Remark[4| below). 


Comparing AMSE of lRDSA-AsymBer to that of 1SPSA 

1 „ r , _ „ . 1 


Observe that 


(1 + 0 


E [d n df\ = I and so. 


(l + 0 : 


r E [«J ] = 


P 


(l + 0‘ 


for any i = 1,..., N. If we used 


U[— 1,1] r.v. for perturbations, then 9E[(djJ 4 ] = | = 1.8 and this value causes AMSE of 1RDSA to be 
much higher than that of ISPS A. 

On the other hand, choosing e = 0.01 for 1RDSA with asymmetric Bernoulli perturbations, we obtain 

1 -.,.-,41 P 


(1 + 0 


rl£[R) 4 ] = 


(1 + 0 


= 1.000099. Plugging this value into the AMSE calculation, we obtain: 


AMSE mVSA-AsymBer(ao, +)) 

AMSE isrsA( a o, do) 

_ (2tfa 0 || (2 a 0 V 2 f(x*) - P)~ l T\\ 1.000099) 2 + ao^trace ((2 a 0 V 2 f(x*) - /3)” 1 5) 

(2«5 2 a 0 ||(2a 0 V 2 /(x*) - P)~'T\\) 2 + aoc^trace (( 2 a 0 V 2 /(z*) - /3)" 1 5’) 

From the above, we observe that 1RDSA has an AMSE that is almost comparable to that of ISPS A. One 
could choose a small e to get this ratio arbitrarily close to 1. 

Remark 4. (1RDSA with Gaussian perturbations ) In /@/, the author simplifies the AMSE for 1RDSA 
by solving AMSEmx>SA{ a oAo) f or 6 o after setting ao optimally using Xq. Using N(0, 1) for d n and 
comparing the resulting AMSE of 1RDSA to that of first-order SPSA with symmetric Bernoulli distributed 
perturbations, they report a ratio of 3 : 1 for the number of measurements to achieve a given accuracy. Here 
3 is a result of the fact that for N{ 0,1) distributed d n , Ed 4 = 3, while l for SPSA comes from a bound on the 
second and inverse second moments, both of which are I for the Bernoulli case. Using U[—r], rj] distributed 
d n in 1RDSA would bring down this ratio to 1.8 : 1. However, this result comes with a huge caveat - that ao 
and 5 q are set optimally. Setting these quantities requires knowledge of the objective, specifically, V 2 /(.x*) 
and the vector T. 


3 Second-order random directions SA (2RDSA) 

As in the case of the first-order scheme, we present two variants of the second-order simulation optimization 
method: one using uniform perturbations and the other using asymmetric Bernoulli perturbations. Recall 
that a second-order adaptive search algorithm has the following form lioll : 

x n +i =x n - a n T(iT ri .) _1 V/(x n ), (13) 

H n = — —rH n -1 H- —~rH n . (14) 

n + 1 n + 1 

In the above, 

• V f(x n ) is the estimate of V fix n ) and this corresponds to © for the uniform valiant and (O for the 
asymmetric Bernoulli valiant. 

• H n is an estimate of the true Flessian V 2 /(-), with Hq = I. 


11 











• H n is a smoothed version of H n , which is crucial to ensure convergence. 

• T is an operator that projects a matrix onto the set of positive definite matrices. Update (fT4l) does not 
necessarily ensure that H n is invertible and without T, the parameter update (fT3l ) may not move along 
a descent direction - see conditions (C7) and (Cl 2) in Section [3721 below for the precise requirements 
on the matrix projection operator. 

The basic algorithm in (fl3l) - (fl4l) is similar to the adaptive scheme analyzed by PjJ. However, we use 
RDSA for the gradient and Hessian estimates, while 0] employs SPSA. 

Remark 5. (Matrix projection) A simple way to define T (/-/,,) is to first perform an eigen-decomposition 
of H n , followed by projecting all the eigenvalues onto the positive real line by adding a positive scalar 5 n - 
see mi . jjOffor a similar operator. This choice for T satisfies the assumptions (C7) and (C12), which are 
required to ensure asymptotic unbiasedness of the Hessian scheme presented in the next section. Note that 
the scalar S n used for T is also used as a perturbation constant for function evaluations (see (1151) below). 



Figure 2: Overall flow of 2-RDSA algorithm. 


3.1 Hessian estimate 

As illustrated in Fig. [2], we use three measurements per iteration in (fl3l) to estimate both the gradient and 
the Hessian of the objective /. These measurements correspond to parameter values x n , x n + 5 n d n and 
x n — 5 n d n . Let us denote these values by y n , yfi and y~ respectively, i.e., 

Un = f {%n) T £nj Vn = f(%n T 8 n d n ) + , y n = f(x n 8 n d n ) + £ n , (15) 

where the noise terms f n ■ ffi ■ satisfy + ff — 2f n H n ] = 0. We next present two constructions for 
the perturbations d n - one based on i.i.d. uniform r.v.s and the other using asymmetric Bernoulli r.v.s. Unlike 
the construction in ||J()|, which entails generating 2N Bernoulli r.v.s in each iteration, our construction 
requires N r.v.s that follow either a uniform or an asymmetric Bernoulli distribution. 
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Uniform perturbations 

Using the three measurements and the random directions obtained from d n , we form the Hessian estimate 
H n as follows: 


H n = 


2rf 


4 Mn 




where M n 


I (&) 2 - f) 
44 

d N d 1 




(16) 


Lemma [4] establishes that the above estimator is of order 0(4) away from the true Hessian. The first step 
of the proof is to use a suitable Taylor’s series expansion of / to obtain 


f (tEn T &ndri) T f(x n d n df) 2/(x n ) 

4 

N N—l N 

= Y^K) 2 v 2 uf(x n ) + 2Y^ E 44v 2 /(x„) + o(4). 

i— 1 i=l j=i +1 


Taking conditional expectations on both sides above, it can be seen that the RHS does not simplify to be 

9 

true Hessian and includes bias terms. By multiplying the term —with the RHS above, we obtain an 

2?7 4 

(asymptotically) unbiased Hessian estimate - see the passage starting from (fl9l ) in the proof of Lemma [4| 
below for details. 


Asymmetric Bernoulli perturbations 

Using the three measurements and the random directions obtained from d n , we form the Hessian estimate 
H n as follows: 


H n = M n 


Vn +Vn ~ 2 Vr 


where M„ = 


4 

£ (( 4) 2 - (1 + e )) 
1 ,/2 ,1\ 
2 (T+eP u n a n 
i ,/Ay/i 

2(1 +e) 2 U n 


1 ,/l ,/2 

2 (l+e)^ a n a n 

i(«) 2 -(i+<o) 

1 rl N rl 2 
2(1+<e) 2 “n u n 


(17) 


1 ,/l ,/iV 

2(l+e) 2 u n u n 
1 r /2 jiV 

2 (l+e)^ UnUn 

i («) 2 - (i + £)) 


where n = E «) J (l - with E(d?J a 

E(d i n )\i = 1.JV. 


(1 + E )(l + (1 + € ) 3 ) 

(2 + e) 


denoting the fourth moment 


Remark 6. (Need for asymmetry) The Hessian estimate that we construct is such that it disallows using 
symmetric Bernoulli r.v.s for perturbations. In particular, in establishing the unbiasedness of the Hessian 
estimate, the proof requires that the second and fourth moments of the perturbation r.v.s be different. Asym¬ 
metric Bernoulli r.v.s (with only a slight asymmetry) meet this condition and can be used in deriving a 
gradient estimate as well. 
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3.2 Main results 

Recall that T n = o(x m ,m < n ) denotes the underlying sigma-field. We make the following assumptions 
that are similar to those in [30]: 

(Cl) The function / is four-times differentiabl^ with | i4 /(at)| < oo, for ii,* 2 ,* 3,*4 = 1 ,... ,N 
and for all x £ 

(C2) For each n and all x, there exists a p > 0 not dependent on n and x, such that (x — x*) T f n (x) > 
p \\x n - x||, where f n (x) = T (H n )~ 1 'V f(x). 

(C3) {( n ,C,( n ",n = 1,2,...} satisfy E[£+ + £“ - 2^ n \F n ] = 0, for all n. 

(C4) Same as (A4). 

(C5) Same as (A5). 

(C 6 ) For each i = 1,..., N and any p > 0, 

P{{fni(x n ) > 0 i.o} n {fni(x n ) < 0 i.o} | {\x ni - x*\ > p Vn}) = 0 . 


(C7) The operator T satisfies <5 2 Y(fF n ) 1 — » 0 a.s. and E(\\Y(H n ) 1 || 2+< ’) < p for some (, p > 0. 
(C 8 ) For any r > 0 and nonempty S C }, there exists a p'(r, S) > t such that 


lim sup 

n—> oo 


Eies( x ~ x *)ifni(x) 


< 1 a.s. 


for all |(re — x*)i\ < r when i ^ S and |(x — x*)i\ > p'{r , S ) when i G S. 

(C9) For some ao, ai > 0 and for all n, E £ n 2 < ao, E ^ 2 < ao, E/(x „,) 2 < a\ and E f(x n ± S n d n ) 2 < 

Oil- 

(CIO) £n < OO. 

For a detailed interpretation of the above conditions, the reader is referred to Section III and Appendix B of 
[30]. In particular, (Cl) holds if the objective / is twice continuously differentiable with a bounded second 
derivative and (C2) ensures the objective / has enough curvature. (C3)-(C5) are standard requirements on 
noise and step-sizes and can be motivated in a similar manner as in the case of 1RDSA (see Section lT2l) . 
(C 6 ) and (C 8 ) are not necessary if the iterates are bounded, i.e., sup n ||.x' n | < oo a.s. (C7) can be ensured 
by having T defined as mentioned earlier, i.e., Y(A) performs an eigen-decomposition of A followed by 
projecting the eigenvalues to the positive side by adding a large enough scalar. Finally, (C9) and (CIO) are 
necessary to ensure convergence of the Hessian recursion, in particular, to invoke a martingale convergence 
result (see Theorem [ 6 ] and its proof below). 

Lemma 4. (Bias in Hessian estimate ) Under (Cl)-(CIO), with H n defined according to either (fl 6 l) or (fTTl) . 
we have a.s. tha\ 0 for i,j = 1,... ,N, 


E 


Hn(i, 3 ) 


Tn 


-V 2 /(x n ) =0(5 2 n ). 


(18) 


d 4 5 f (x 


4 Here X7 4 f(x) = 4 ; — denotes the fourth derivate of / at x and Vf. f(x) denotes the (iifaiaiAth entry of 

ox T ox T ox T ox T 4 

V 4 f(x), for h, * 2 , 13 , u = l,...,N. 

5 Here H n (i, j) and /(•) denote the (i, jf)th entry in the Hessian estimate H n and the true Hessian V 2 /(-), respectively. 
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From the above lemma, it is evident that the bias in the Hessian estimate above is of the same order as 
2SPSA of Q. 


Proof. ( Lemma 0 

Case 1: Uniform perturbations 

By a Taylor’s series expansion, we obtain 

f(x n ± 5 n d n ) =f(x n ) ± 5 n d T n Vf(x n ) + Y d n^ 2 f( x n)dn ± y V 3 /(x n )(<4 (8) d n <g> d n ) 

+ ^V 4 /(x+)(d n ®d n ®d n ® d n ). 

The fourth-order term in each of the expansions above can be shown to be of order 0(5%) using (Cl) and 
arguments similar to that in Lenmia[T|(see (fill) there). Hence, 

f {Xn + Mn) + f(Xn - S n d n ) - 2f(x n ) =rf r ^2 f + 

N N 

=EE« v oAM+o(*n) 

i=l j= 1 

JV JV-1 N 

= E«)M/(^) + 2 E E «V?y/(x n )+0(^). 

2=1 2=1 J=2+l 


Now, taking the conditional expectation of the Hessian estimate H„ and observing that E[£+ + f n — 2f n 
F n ] = 0 by (C3), we obtain the following: 

E[H n | F n ] = E 

Note that the 0(5%) term inside the conditional expectation above remains 0(5%) even after the multi¬ 
plication with M n . We analyse the diagonal and off-diagonal terms in the multiplication of the matrix M n 
with the scalar above, ignoring the 0(5%) term. 


JV—1 


Mn E (4) 2v «/(*r 


N 

i \ 


N 


2=1 


+ 22. E 

2=1 j=i-\-l 


+ 0(5% 


Fn 


(19) 


Diagonal terms in dl9l) : 

Consider the Ith diagonal term inside the conditional expectation in (fT9l ): 


JV-1 JV 

(x n )+2^ ]r di,d!Xj,f( 

2=1 J= 2 +l 

A - N . N-l N 

--T-M) 2 E E 

^ i= 1 9 i=i j=f+ 

JV 


§ji X - y) (E(<) 2 v?,/( a 

45 

(rr X trl* 1*V?. fl'T- 1 4- 

2r] 4 



d%diV% j f(x n ) 


( 20 ) 


2=1 ,7=2+1 

AT AT—1 AT 

- S e«) 2 v!/(x„) -See <divy(x n ) 

1 i =1 ' i=l y=i+l 

From the distributions of ri’ n . dh and the fact that df is independent of dn for z < j, it is easy to see that 

/ JV—1 JV . \ (N-l N 

E «) 2 E E d%dX% j f(x n ) Fn = o ! - ~ " " 

\ *=i i=*+i / 

conditional expectations of the second and fourth 


\ / JV-l At 

F n = 0 and E E E d%diV% j f(x n ) 

I V 2=1 ,7=2+1 

terms on the RHS of (l20l ) are both 


F n J = 0. Thus, the 
zero. 
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The first term on the RHS of (l20l) with the conditional expectation can be simplified as follows: 


45 


N 


4 ^e( (4) 2 E(4) 2v ?i/(^ 


2=1 


Fn I =S E I + E (4) 2 (4) 2 v|/(^ 

4 4 -N 

5 I T v »^"> + T E 


N 


4 rj 4 


a.s. 


2 = 1 , 


For the second equality above, we have used the fact that E[(d^) 4 ] = and IE [{d l n ) 2 (d l n ) 2 ] = E[(d z ra ) 2 ]E[(d5 1 ) 2 ] 

—, V/ / i. 

9 

The third term in (1201 ) with the conditional expectation and without the negative sign can be simplified 
as follows: 


N 




X r , 


15 


N 




N 


a - s - 


2=1 


Combining the above followed by some algebra, we obtain 

N N—l N 


_45 
4 1] 4 


E 


(4) 2 -y) (E( t C)M/W + 2 E E 44V?-/(a 

2 i=l i=l j=i+l 


-T'n 


= V uf{x n ), a.s. 


Off-diagonal terms in (fl9l) : 

We now consider the (/c, Z)th term in ( fl9l ): Assume w.l.o.g that k < l. Then, 


2r] 4 


rE 


N 


N N—l N 

d k n d i n [^(d i n ) 2 vy(x n )+2j2 E <diyif{x n ) 

. 2=1 2=1 J=2+l 

N—l N 


T 

J n. 


=AE E (44K.) 2 ) v2 i /(^) + | r E E E (4444) v?-/(a 

'' i= 1 '' i=l j=»+i 

=V^/(X n). 


( 21 ) 


The last equality follows from the fact that the first term in (l2Tb is 0 since k ^ l, while the second term in 

9 

(ED can be seen to be equal to —E ({d^) 2 (d l n ) 2 ) V|;/(x n ) = V| z /(x n ). The claim follows for the case of 
uniform perturbations. 


Case 2: Asymmetric Bernoulli perturbations 

Note that the proof up to (fl9l ) is independent of the choice of perturbations. The proof differs in the 
analysis of the diagonal and off-diagonal terms in (fl9l ). In the case of asymmetric Bernoulli perturbations, 
the normalizing scalars in the definition of M n in (fT71) are different. 
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Diagonal terms in (fT9l) 

(l + 6 )(l + (l + 6 ) 3 ) 


Let 0 = 


(l 20 l) is as follows: 


(2 + e) 


denote the fourth moment E(d l n Y, for any i = 1,..., N. An analogue of 


El ( (dl) 2 - 


N 


E ( 4 ) 2 ^«) 2 V|/(x, 


2=1 


N N—l N 

(l + e)) [EKJM./(*n) + 2J2 E «V 2 /(x, 

„ 2 =1 i—1 j=i-\-1 

^ I-(E( d n) 2 v 2 ,/( a 




. 2=1 


n. 


Tn (22) 


We have used the fact that the second term in the LHS above is conditionally zero (see argument below (l20l) 
for a justification). The first term on the RHS of (f22l) be simplified as follows: 


1 


N 


0( i - AA) 
4a-AA) 




i=l 


T 

J n. 


N 


( 1 - 


( 1 +e ) 2 


E (0 4 V?,/(a:„) + E 00 2 (4) 2 V?i/(*„) 

v?,/(*„) + ^-^ e v ^/(^ 

2 = 1 , 2 ^Z 


(23) 


For the second equality above, we have used the fact that E[(d^) 4 ] = cj) andE^e?^) 2 ^) 2 ] = E[(dJ l ) 2 ]E[(dj l ) 2 ] 
(1 + e) 2 , V/ i. 

The second term in (1221) with the conditional expectation and without the negative sign can be simplified 
as follows: 


(1 + ^) 


N 


0(1 


-E ]T«) 2 vS/(*, 


„ 2=1 


Tn = 


Lii) £e[(<4 ) 2 ] vj/d. 


0(1 tt 


N 


- ( 1 + e ) 2 Vv 2 f( x ) 

Combining (1231) and (l24l) . the correctness of the Flessian estimate follows for the diagonal terms. 


(24) 


Off-diagonal terms in (fl9l) 

Consider the (A:, Z)th term in (fl9l) . with k < l. We obtain 

( N N -1 N 

E«) 2 v|/(x ri ) + 2 E E 44v?-/(®, 

2=1 2=1 J=2+l 

N—l N 

E E (dUUO 2 ) V 2 /(*„) + T—- -2 E E E {d k n d l n<di) V?-/(*n) (25) 

1 + i=l j=i+l 


2(1 + e ) 2 


E 


+4 


JV 


" 2 ( 1 + e ) 2 ^ 


2=1 


=V^/(x, 


Note that the first term on the RHS of (1251) equals zero since k ^ l. The claim follows for the case of 
asymmetric Bernoulli perturbations. □ 
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Theorem 5. (Strong Convergence of parameter) Assume (C1)-(C8). Then x n —> x* a.s. as n —> oo, where 
x n is given by d- 

Proof. From Lemma[T] we observe that the gradient estimate V/(.x>J in (fT3l) satisfies: 


E 


V/(s fl 


Tn 


v/(s„) + A», 


where the bias term \i n is such that S n 2 \\fi n \\ is uniformly bounded for sufficiently large n. The rest of 
proof follows in a manner similar to the proof of Theorem la in [30 E 


the 

□ 


Theorem 6 . (Strong Convergence of Hessian) Assume (Cl)-(CIO). Then H n —>• V 2 /(x*) a.s. as n —>• oo, 
where //,, is governed by d and H n defined according to either d or d- 


Proof. We first use a martingale convergence result to show that 


l 

71+1 


E n 
m— 


H m ~ E 

//, 


Hn 


_L_ V” F 

ra+l 2Em=0 ^ 

the detailed proof. 


V 2 /+ 


0 a.s. Next, using Lemma [4] we can conclude that 

a.s. and the claim follows. The reader is referred to Appendix iBl for 

□ 


We next present a asymptotic normality result for 2RDSA under the following additional assumptions: 

(Cll) For some C, « 0 ) cti > 0 and for all n, E£ n 2+ ^ < op* E+ 2+< " < a 0 > E/(;r ri ) 2+< ’ < a\ and E f(x n ± 
<Wn) 2+C < ai. 

(C12) The operator T is chosen such that ||T(iT n ) — H n || —> 0 a.s. as n —>• oo. 

Assumption (Cll) is required to ignore the effects of noise, while (C12) together with Theorem [ 6 ] ensures 
that T (H n ) converges to the true Flessian a.s. It is easy to see that the choice suggested in Remark[5]for T 
satisfies (C 12 ). 

The main result is as follows: 

Theorem 7. (Asymptotic Normality) Assume (C1)-(C12) and that V 2 /(x *) -1 exists. Let a n = ao/n ,y and 
bn = <W n7 > where ao,<5o > 0, a € (0,1] and 7 > 1/6. Let (3 = a — 2y. Let E(<+ — +) 2 —> o 2 as 
n —> 00 . Then, we have 

n^ 2 (x n — x*) Af(p, fi) as n — > 00 , (26) 

where Af(p, f l) is the multivariate Gaussian distribution with mean // and covariance matrix Q. The 
mean p is defined as follows: p = 0 (an N-vector of all zeros) if 7 > a /6 and p = +( 0 + 0 ( 20.0 — 
/3 + ) _ 1 V 2 / (x*)~ 1 T) if 7 = a/ 6 , where 


= 


3.6 

2(1 + e)(i + (i + 7 ) 
(2 + e)(1+0 2 


for uniform perturbations, 

for asymmetric Bernoulli perturbations. 


In the above, T and fi + are as defined in Theorem\3\ The covariance matrix Q is defined as follows: 


n = 


a o ^ 2 

4<5 0 V(8a 0 - 4/3+) 


(vv+r 1 ) 2 . 


Proof. As in the case of 2SPSA of |30ll, we verify conditions (2.2. l)-(2.2.3) of |@] to establish the result and 
the reader is referred to Appendix [B|for details. □ 


'’Note that the proof of Theorem [5] does not assume any particular form of the Hessian estimate and only requires assumptions 
(C1)-(C7), which are similar to those in & The only variation in our case, in comparison to ^1, is the gradient estimation uses 
an RDSA scheme while 13011 uses first-order SPSA. Thus, only the first step of the proof differs and in our case, Lemma[T]controls 
the bias with the same order as that of SPSA, leading to the final result. 
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3.3 (Asymptotic) convergence rates 

Recall from Theorems [3] and [7] that we set a n = ao/n a and S n = Sq/ti 1 , where ao,<5o > 0, a € (0,1] 
and 7 > 1/6. Let AMS£ 2 nvsA-Unif(ao , <5o) and AMS£ 2 TiDSA-AsymBer(ao, <5o) denote the AMSE for the 
uniform and asymmetric Bernoulli valiants of 2RDSA, respectively. These quantities can be derived using 
Theorems [3] and [7] as follows: 

AMS£2TivsA-Unif(ao,8o) = ||(^ 2 fi x *)~ 1 T\\'j 

+ «2^) M “( v2 /(,r'sv7(,*)-7 

AMS£ 2 tivsa —AsymBer(O j 0 ? * ) = ((i+!)4“°-/3) ll (v2/(irl dl) 

+ ^73) M “( v2/(l ‘ rIsv2/(l * )_1 )- 

where T is as defined in Theorem [3] and S = ^/. 

Recall from the discussion in Section [23] that 1RDSA has a problem dependence on the minimum 
eigenvalue Ao of V 2 /(x*) for the step-size constant a o- One can get rid of this dependence by using one of 
the 2RDSA variants and ao = 1. An alternative is to use iterate averaging, whose AMSE can be shown to 
be: 

AMS£ m vsA-Av g (5 0 ) = ||(V 2 /(.t*)- 1 T||) + ^^y trace (V 2 /(z*)- 1 SV 2 /(x*)" 1 ) . 

Notice that with either variant of 2RDSA one obtains the same rate as with iterate averaging and both these 
schemes do not have dependence on Ao- Moreover, using arguments similar to [tJ] (see expressions (5.2) 
and (5.3) there), we obtain 


VSo,AMS£2-R.vsA-UmfO-,8o) < 2 min AMS£mvsA( a oAo), 

a o>0/ ( 2 A 0 ) 

V(5o, AMS£ 2 nvsA-AsymBeri)-, ^o) < 2 min AMS£mvsA{ a oAo)- 

ao>0/(2\o) 


Note that the above bound holds for any choice of So. Thus, 2RDSA is a robust scheme, as a wrong choice 
for ao would adversely affect the bound for 1RDSA, while 2RDSA has no such dependence on ao- 


Remark 7. ( Iterate averaging ) Only from an “asymptotic” convergence rate viewpoint is it optimal to use 
larger step-sizes and iterate averaging. Finite-sample analysis (Theorem 2.4 in t \l'Al ) shows that the initial 
error (which depends on the starting point of the algorithm) is not forgotten sub-exponentially fast, but at 
the rate 1/n, where n is the number of iterations. Thus, the effect of averaging kicks in only after enough 
iterations have passed and the bulk of the iterates are centered around the optimum. See Section 4.5 in Mill 
for a detailed discussion on this topic. 


Comparing 2RDSA-Unif vs 2SPSA. 

Taking the ratio of AMSE of 2RDSA with uniform perturbations to that of 2SPSA, we obtain: 


AMS£2nvsA-unif{^^ <5q) _ 3.24(A) + (B) 
AMS£2svsa(1^ <5o) (A) + (B) 


(27) 
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(28) 


^) = (^IK v2 /^T 1 r||) , 

(B) trace {V 2 f (x*)- 1 SV 2 f (x*)- 1 ) ■ (29) 

However, 2SPSA uses four system simulations per iteration, while 2RDSA-Unif uses only three. So, in 
order to achieve a given accuracy, the ratio of the number of simulations needed for 2RDSA-Unif (denoted 
by W 2 RDSA-Unif) to that for 2SPSA (denoted by tUspsa) is 

n'iiZ'DSA—Unif _3 ^ AMS£ 2 'R.vsA-Unif(,l,So) _ 3 x 3.24(A) + (H) _ 5.72(A) — (B) 

fi2SVSA 4 AMS£ 2 svsa{^-Ao) 4 (A) + (H) 4(A) + 4(H) 

Thus, if 5.72(A) — ( B ) < 0, 2RDSA-Unif’s AMSE is better than 2SPSA. On the other hand, if 5.72(A) — 
(H) ~ 0, 2RDSA-Unif is comparable to 2SPSA and finally, in the case where 5.72(A) — (H) > 0, 2SPSA 
is better, but the difference may be minor unless 5.72(A) >> (H), as we have the term 4(A) + 4(H) in the 
denominator above. Note that the quantities (A) and (H) are problem-dependent, as they require knowledge 
of V 2 /(z*) andT. 

Unlike the first-order algorithms, one cannot conclude that 2SPSA is better than 2RDSA-Unif even when 
V 2 /(x'*j and T are known. 2RDSA-Unif uses fewer simulations per iteration, which may tilt the balance 
in favor of 2RDSA-Unif. We next show that asymmetric Bernoulli distributions are a better alternative, as 
they result in an AMSE for 2RDSA schemes that is lower than that for 2SPSA on all problem instances. 

Comparing AMSE of 2RDSA-AsymBer to that of 2SPSA. 

Taking the ratio of AMSE of 2RDSA with asymmetric Bernoulli perturbations to that of 2SPSA, we obtain: 

o2 

AMS£ 2 R.VSA-AsymBer{^Ao) _ (1+e) 2 ^ 

AMS£ 2 spsa(1Ao) (21) + (H) ’ 

where (A) and (H) are as defined in (I28l) -(l29l). However, 2SPSA uses four system simulations per iteration, 
while 2RDSA-AsymBer uses only three. So, in order to achieve a given accuracy, the ratio of the number 
of simulations needed for 2RDSA-AsymBer (denoted by fURDSA-AsymBer) to that for 2SPSA (denoted by 
«2SPSa) is 

n2TZVSA-AsymBer 3 AMS£2'R.VSA-AsymBer (1 ^0) _ 3 (1+e) 2 (^) + (B) 

n 2 svsA 4 AMS£2svsaO-Ao) 4 (A) + (H) 

For the sake of illustration, we set e = 0.01 in the asymmetric Bernoulli distribution © to obtain 

^2'R.VSA-AsymBer _ 3.0000057(A) + 3(H) ^ 

^2SVSA 4(A) + 4(H) 

Thus, 2RDSA with asymmetric Bernoulli distribution AMSE is clearly better than 2SPSA on all problem 
instances, as (A) and (H) are positive (albeit unknown) quantities. 

4 Numerical Experiments 

4.1 Setting 

We use two functions, both in A^ = 10 dimensions for evaluating our algorithms. 
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Quadratic function: Let A be such that NA is an upper triangular matrix with each entry one and b is the 
iV-dimensional vector of ones. Then, the quadratic objective function is defined as follows: 

f(x) = x T Ax + b J x, (30) 

The optimum x* for / is such that each coordinate of x* is —0.9091, with f(x*) = —4.55. 

Fourth-order function: This is the function used for evaluating the second-order SPSA algorithm in |20.] 
and is given as follows: 

N N 

f(x) = x T A T Ax + 0.1 ^2(Ax)^ + 0.01 ^^(Ax)j, (31) 

3 = 1 3 = 1 

where A is the same as that in the case of quadratic loss. The optimum x* = 0 with fix*) = 0. 

For any x, the noise is [x T , 1 \z, where 2 is distributed as a multivariate Gaussian distribution in 11 dimensions 
with mean 0 and covariance cr 2 /n x ii- As remarked in [30], the motivation for this noise structure is to have 
most of the noise components depend on the iterate x n and also a component z that ensures that the variance 
is at least cr 2 . 


4.2 Implementation 

We implement the following algorithmic 

First-order: This class includes the RDSA schemes with uniform and asymmetric Bernoulli distributions 
- lRDSA-Unif and lRDSA-AsymBer, respectively, and regular SPSA with Bernoulli perturbations - 
ISPS A. 


Second-order: This class includes 2RDSA-Unif and 2RDSA-AsymBer - the second-order RDSA schemes 
with uniform and asymmetric Bernoulli distributions, respectively, and also regular second-order 
SPSA with Bernoulli perturbations - 2SPSA. 


For 1RDSA and 1SPSA, we set 5 n = 1.9/n 0101 and a n = 1 /{n + 50). For 2RDSA and 2SPSA, we 
set 5 n = 3.8/n 0101 and a n = 1/n 0 6 . These choices are motivated by standard guidelines - see Bill . For 
uniform perturbation variants, we set the distribution parameter rj = 1 and for the asymmetric Bernoulli 
valiants, we set the distribution parameter e as follows: e = 0.0001 for lRDSA-AsymBer and e = 1 for 
2RDSA-AsymBer. These choices are motivated by a sensitivity study with different choices for e - see 
Tables [6aT - [6b1 in Appendix [C] For all the algorithms, the initial point xo is the N = 10-dimensional vector of 
ones. To keep the iterates stable, each coordinate of the parameter 6 is projected onto the set [—2.048, 2.047]. 
All results arc averages over 1000 replications. 

We use normalized mean square error (NMSE) as the performance metric for comparing algorithms. 
This quantity is defined as follows: 

I kn end -Z*f 


NMSE = 


IN - x* 


where x Ucni is the algorithm iterate at the end of the simulation. Note, n en d is algorithm-specific and a 
function of the number of measurements. For instance, with 2000 measurements, n = 1000 for both 
1SPSA and both 1RDSA variants, as they use two measurements per iteration. On the other hand, for 
2SPSA and both 2RDSA variants, an initial 20% of the measurements were used up by 1SPSA/1RDSA 


7 The implementation is available at https : //github . com/prashla/RDSA/archive/master . zip 
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Table 2: NMSE for quadratic objective (l30l) and noise parameter a = 0.001: 
standard error from 1000 replications shown after ± 


First-order Algorithms 

No. of function 

measurements 

1SPSA 

lRDSA-Unif 

lRDSA-AsymBer 

1000 

4.15 X 10“ 2 ± 5.15 X 10“ 4 

4.53 x 10" 2 ±5.72 x 10~ 4 

4.18 x 10“ 2 ±5.41 x 10" 4 

2000 

3.42 x 10“ 2 ±4.68 x 10“ 4 

3.67 x 10 -2 ± 5.28 x 10~ 4 

3.38 X 10“ 2 ± 4.84 X 10“ 4 

Second-order Algorithms 

No. of function 

measurements 

2SPSA 

2RDSA-Unif 

2RDSA-AsymBer 

1000 

1.05 x 10 -3 ± 2.25 x 10" 5 

9.61 x 10" 5 ± 2.48 x 10~ 6 

8.39 X 10- 5 ± 2.25 X 10“ 6 

2000 

3.60 x 10 -6 ± 7.62 x 10" 8 

4.48 x 10" 6 ± 6.61 x 10~ 8 

2.24 X IQ -6 ± 3.35 X 10“ 8 


and the resulting iterates were used to initialize the corresponding second-order method. Thus, with 2000 
measurements available, the initial 400 measurements are used for 1SPSA/1RDSA and the remaining 1600 
are used up by 2SPSA/2RDSA. This results in n en d of 1600/4 = 400 for 2SPSA and 1600/3 ~ 533 for 
2RDSA algorithms. Note that the difference here is due to the fact that 2RDSA uses 3 simulations per 
iteration, while 2SPSA needs 4. 

4.3 Results: Quadratic objective 

Tables [2]-[3] present the normalized mean square error (NMSE) for both first and second-order algorithms 
with quadratic objective (l30l ) and noise parameter a set to 0.001 and 0. 

Observation 1: Among first-order schemes, lRDSA-AsymBer performs on par with 1SPSA, while 1RDSA- 
Unifis sub-par. 

The NMSE of lRDSA-AsymBer is comparable to that of 1SPSA, while lRDSA-Unif results in a higher 
NMSE. This is consistent with the asymptotic rate results discussed earlier in Section [23] 

Observation 2: Second-order schemes outperform their first-order counterparts, and 2RDSA-AsyrnBer 
performs best in this class. 

The first part of the observation is consistent with earlier results for the low noise regime (i.e., a = 0.01), 
for instance, see H. Further, the gains of using second-order schemes are more noticeable in the zero-noise 
regime (see Table [3]). Moreover, 2RDSA-AsymBer results in the best NMSE. In fact, running 2RDSA- 
AsymBer for 400 iterations, which was the number used for 2SPSA with 2000 measurements available, the 
resulting NMSE values was found to be 2.34 x 10 -6 , which is better than the corresponding 400-iteration 
result of 3.60 x 10 -6 for 2SPSA (see Table [2]) , while using only 75% as many simulations. 

4.4 Results: Fourth-order objective 

Table 0] presents results similar to those in Table |2] for the fourth-order objective function (I3TI) with the 
noise parameter a set to 0.001. In addition, we also present the normalized function values in Table [5] 
The normalized function value is defined as the ratio /(x nend )//(xo). Considering that the fourth-order 
objective is more difficult to optimize in comparison to the quadratic one, we run all algorithms with a 
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Table 3: NMSE for quadratic objective (l30l) and noise parameter a = 0: 
standard error from 1000 replications shown after ± 


First-order Algorithms 

No. of function 

measurements 

1SPSA 

lRDSA-Unif 

lRDSA-AsymBer 

1000 

4.15 X 10“ 2 ± 5.15 X 10“ 4 

4.53 x 10 -2 ± 5.72 x 10~ 4 

4.18 x 10” 2 ±5.41 x 10" 4 

2000 

3.42 x 10“ 2 ±4.68 x 10“ 4 

3.67 x 10 -2 ± 5.28 x 10~ 4 

3.37 X 10“ 2 ± 4.87 X 10“ 4 

Second-order Algorithms 

No. of function 

measurements 

2SPSA 

2RDSA-Unif 

2RDSA-AsymBer 

1000 

7.57 x 10 -4 ± 1.59 x 10" 5 

9.34 x 10" 5 ± 2.49 x 10“ 6 

8.27 X 10“ 5 ± 2.25 X 10“ 6 

2000 

6.77 x 10 -7 ± 2.78 x 10" 8 

2.42 x 10" 9 ± 1.11 x 10~ 10 

2.90 X 10“ 9 ± 1.41 X lO” 10 


simulation budget of 10000 function evaluations. From the results in Tables [DEU one can draw conclusions 
similar to that in observations 1 and 2 above, except that 2RDSA-Unif shows the best performance among 
second-order schemes. 


Remark 8. (Enhanced 2SPSA) In mi enhancements to the 2SPSA algorithm incorporated adaptive feed¬ 
back and weighting to improve Hessian estimates. However, preliminary numerical experiments that we 
conducted for enhanced 2SPSA with the parameters recommended in M indicate that the benefits of such 
a scheme kick in only after a large number of iterations. For instance, for the fourth-order objective function 
<ED. running the enhanced 2SPSA algorithm resulted in a high NMSE and the latter became comparable to 
that of 2SPSA after increasing the simulation budget of 2000. 

Finally, the numerical results presented in Tables [DEI make a fair comparison in the sense that, except 
the perturbations every other parameter (e.g.„ step-sizes a n , perturbation constants 5 n , initial point xq) is 
kept constant across algorithms in each class ( first/second-order). The results demonstrate that it is indeed 
advantageous to use uniform/asymmetric Bernoulli perturbations. It would be interesting future work to 
enhance 2RDSA schemes to improve the Hessian estimates along the lines of Mi and then numerically 
compare the performance of enhanced 2RDSA schemes with that of enhanced 2SPSA. 


5 Conclusions 

We considered a general problem of optimization under noisy observations and presented the first adaptive 
random directions Newton algorithm. Two sets of i.i.d. random perturbations were analyzed: symmetric 
uniformly distributed and asymmetric Bernoulli distributed. In addition, we also presented a simple gradient 
search scheme using two sets of perturbations. While our gradient search scheme requires the same num¬ 
ber of perturbations and system simulations per iteration as the simultaneous perturbation gradient scheme 
of a our Newton scheme only requires half the number of perturbations and three-fourths the number 
of simulations as compared to the simultaneous perturbation Newton algorithm of |!3(J]. We proved the 
convergence of our algorithms and analyzed their rates of convergence using the asymptotic mean square 
error (AMSE). From this analysis, we concluded that the asymmetric Bernoulli perturbation variants exhibit 
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Table 4: NMSE for fourth-order objective (|3TT) and noise parameter cr = 0.001: 
standard error from 1000 replications shown after ± 


First-order Algorithms 

No. of function 

measurements 

1SPSA 

lRDSA-Unif 

lRDSA-AsymBer 

2000 

1.37 x 10” 1 ± 1.39 x 10" 3 

1.38 x 10" 1 ± 1.33 x 10" 3 

1.35 X 10- 1 ± 1.36 X 10- 3 

10000 

1.14 X IQ -1 ± 1.14 X 10“ 3 

1.18 x 10” 1 ± 1,23 x 10~ 3 

1.14 X IQ -1 ± 1.23 X 10“ 3 

Second-order Algorithms 

No. of function 
measurements 

2SPSA 

2RDSA-Unif 

2RDSA-AsymBer 

2000 

3.2 x 10" 2 ±5.38 x 10 -4 

1.48 X 10“ 2 ± 2.64 X 10“ 4 

4.89 x 10 -2 ± 9.01 x 10" 4 

10000 

1.01 x 10 -2 ± 1.96 x 10 -4 

1.74 X 10“ 3 ± 3.65 X IQ” 5 

6.45 x 10" 2 ± 1.48 x 10" 3 


Table 5: Normalized function values for fourth-order objective (|3TI) and noise parameter a = 0.001: stan¬ 
dard error from 1000 replications shown after ± 


First-order Algorithms 

No. of function 

measurements 

1SPSA 

lRDSA-Unif 

lRDSA-AsymBer 

2000 

9.8 x 10" 3 ± 1.01 x 10~ 4 

1.1 x 10" 2 ±1,01 x 10" 4 

9.6 X 10“ 3 ± 1.01 X 10“ 3 

10000 

6.1 X 10“ 3 ± 6.96 X It)” 5 

6.3 x 10“ 3 ± 7,27 x 10- 5 

6.1 X 10“ 3 ± 7.27 X 10“ 5 

Second-order Algorithms 

No. of function 

measurements 

2SPSA 

2RDSA-Unif 

2RDSA-AsymBer 

2000 

2.55 x 10~ 3 ± 3.35 x 10" 5 

2.17 X 10“ 4 ± 1.66 X 10-® 

1.97 x 10~ 3 ± 1.83 x 10~ 4 

10000 

7.62 x 10" 4 ±1.1 x 10~ 5 

4.41 X 10“ 5 ± 4.42 X 10“ 6 

1.54 x 10~ 3 ± 1.45 x 10~ 4 
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the best AMSE for both first- and second-order RDSA schemes. Furthermore, our numerical experiments 
show that our Newton algorithm requires only 75% of the number of function evaluations as required by the 
Newton algorithm of [30 ] while providing the same accuracy levels as the latter algorithm. 

As future work, we outline two possible directions. First, it would be interesting to use the approach of 
1320 to arrive at an adaptive scheme that incorporates a feedback term to improve the quality of the Hessian 
estimate. Second, it would be of interest to extend our algorithms to scenarios where the noise random 
variables form a parameterized Markov process and to develop multiscale algorithms in this setting for 
long-run average or infinite horizon discounted costs. Such algorithms will be of relevance in the context of 
reinforcement learning, for instance, as actor-critic algorithms. 


Appendix 


A Proofs for 1RDSA 


Proof of Theorem [2] 

Proof. We first rewrite the update rule ([4]) as follows: 


%n +1 — x n f(%n) T Vn T Pn)i 


(32) 


where rj n = V/(x n ) — E (Vf(x n ) \ T n ) is a martingale difference error term and j3 n = E(V/(x n ) | 
J~n) — V/(.x n ) is the bias in the gradient estimate. Convergence of (l32l) can be inferred from Theorem 
2.3.1 on pp. 39 of [21], provided we verify that the assumptions A2.2.1 to A2.2.3 and A2.2.4” of 0 are 
satisfied. We mention these assumptions as (B1)-(B4) below. 


(B1) V/ is a continuous M iV -valued function. 

(B2) The sequence 3 n , n > 0 is almost surely bounded with 3 n —r 0 almost surely as n —> oo. 
(B3) The step-sizes a n ,n> 0 satisfy 


a(n) —> 0 as n —> oo and 


a n = oo. 


(B4) {rj n , n > 0} is a sequence such that for any e > 0, 



The above assumptions can be verified for (l32l) as follows: 

• (Al) implies (Bl). 

• (A5) together with ffTTb in the proof of Femma Q] imply that the bias (3 n is almost surely bounded. 
Further, Femma [I] implies that 3 n is of the order 0(5%) and since 5 n —> 0 as n —> oo (see (A5)), we 
have that f3 n —> 0. Thus, (B2) is satisfied. 

• (A5) implies (B3). 
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We now verify (B4) using arguments similar to those used in Chapter 7.3 of |31]: We first recall a 
martingale inequality attributed to Doob (also given as (2.1.7) on pp. 27 of [21]]): 


m> 0 


l 


P sup ||Wm|| > e < ~2 lim E ||W ? 


^ m—>-oo 


(33) 


We apply the above inequality in our setting to the martingale sequence {W n }, where W n := W'' =(| ' 
n > 1 , to obtain 


P sup 

V m>n 


E 




^ e ) ^ E 


E 


CLiTH 


1 


IXeii* 


(34) 


The last equality above follows by observing that, for m < n, K(r] m r] n ) = E(p m E(p n | P n )) = 0. 

Using the identity E ||A — E[X | 7 7 ra ]|| 2 < E ||A ^|| 2 for any random variable X, we bound E \\r] n 
as follows: 


E ||pn || 2 <-/VE (rf n ) 2 for some z e TV} 


N 9 

= T?2 E { d n(Vn ~Vn)) 

N ^2\V2 


2 \ 1/2 


<4^(( E tot) 2 ) +(e(<4^) 2 ) 


N 


[ E ((^) 2+2P1 )] ^ E K)] 


■4 5 2 

Q 

<-=?, for some C < 00 . 


| 2+2p2 


1+P2 


+ 


IE K)] 


— \"i 2+2p2 


1+P2 


(35) 

(36) 

(37) 

(38) 


The inequality in (l36l) follows by the fact that E (X+Y) 2 < ((EA 2 ) 1 / 2 + (ET' 2 ) 1 / 2 )". The inequality 
in (l37l) uses Holder’s inequality, with p\. P 2 > 0 satisfying = 1. The inequality in (l20l) 

follows from (A3) and the fact that the perturbations d n have finite moments. 

Plugging (l20l) into (l34l) . we obtain 


lim P sup 

n ^>°° \ m>n 




00 9 

af 


> e I < 3 lim y = 0 . 

n—>00 zJ n . 


The equality above follows from the the fact that V n ( f 21 ) <00 (see (A5)). 


The claim follows from Theorem 2.3.1 on pp. 39 of [21]. 


□ 


B Proofs for 2RDSA 

Proof of Theorem [6] 


Proof. The proof proceeds in exactly the same manner manner as the proof of Theorem 2a in [30]. For the 
sake of completeness, we sketch below the main arguments involved in the proof. 


Let Wm = Hm — E 


Hr, 


. Then, we know that ElU m = 0. In addition, we have 

2 


E||W m || 2 


< OO. 


The latter follows by first observing that E 


Hr, 


m m z 

< 00 , Vm uniformly as a consequence of (C9) and 
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then coupling this fact with (C8). Now, applying a martingale convergence result from p. 397 of [22] to 
W m , we obtain 


1 


n + 1 


m =0 


From Proposition [4] we know that E 
1 


^ ' yH m E H m x m J y 0 a.s. 

= V 2 /(x n ) + 0(<5 2 ). 


(39) 


H r , 


n + 1 


E E 


H rt 


1 


n + 


y E ( v 2 /(^-)+°(<t)) v2 /(^) a - s - 


m=0 m =0 

The final step above follows from the fact that the Hessian is continuous near x n and Theorem [5] which 
implies x n converges almost surely to x*. Thus, we obtain 


1 


n + 


Y E -»• V 2 /(x*) a.s. 


m =0 


and the claim follows by observing that H m = -—-—j- ^m- 


□ 


Proof of Theorem [7] 


Proof. We use the well-known result for establishing asymptotic normality of stochastic approximation 
schemes from [9|]. As in the case of SPSA-based algorithms (cf. a . iBoh ). for 1RDSA, it can be shown 
that, for sufficiently large n, there exists a x n on the line segment that connects x n and x*, such that the 
following holds: 

E[V/(x„) I x n ] = V 2 f(x n )(x n - X*) + fin, 


where /3 n = E(V/(x n ) | x n ) — Vf(x n ) is the bias in the gradient estimate. Next, we write the estimation 
error x n+ \ — x* in a form that is amenable for applying the result from [9Q, as follows: 

x n+1 -x* = (I- n~ a T n )(x n - x*) + n~^/ 2 ^ n V n + n^/ 2 T(H n )- l T n , (40) 

where T n = a 0 T(F n )- 1 V 2 /(x ri ), = -a 0 F(H n )-\ V n = nT!{Vffx n ) - E(V/(x n ) | x n )) and 

T n = —(i(ynJ^ 2 3 n . The above recursion is similar to that for 2SPSA of [30], except that we estimate the 
Hessian and gradients using RDSA and not SPSA. 

For establishing the main claim, one needs to verify conditions (2.2.1) to (2.2.3) in Theorem 2.2 of [9]. 
This can be done as follows: 


• From the results in Theorems [5] and (6[ we know that x n and V 2 f(x n ) converge to x* and V 2 /(x*), 
respectively. Thus, T n — > ao, 4> n — > — aoV 2 /(x*) -1 . Moreover, T n is identical to that in 1RDSA 
and hence, T n — »0if7>a/6 and if 7 = q/G, then the limit of T n is the vector T as defined in 
Theorem [3j These observations together imply that condition (2.2.1) of [@] is satisfied. 


V n is also identical to that in 1RDSA and hence, E(V n Vf \ x n ) —> |4 0 1 o 2 I. This implies condition 
(2.2.2) of |@] is satisfied. 


Condition (2.2.3) can be verified using arguments that are the same as those in [29] for first-order 
SPSA. 


Now, applying Theorem 2.2 of |{9], it is straightforward to obtain the expressions for the mean // and covari¬ 
ance matrix T of the limiting Gaussian distribution. □ 
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C Additional Numerical Results 


Tables l6al and l6blpresent the results from a sensitivity study conducted for the asymmetric Bernoulli variants 
of 1RDSA and 2RDSA, respectively. 


Table 6: NMSE of asymmetric Bernoulli perturbations based RDSA schemes as a function of distribution 
parameter e, with cr = 0.001 and using 2000 function measurements: 
standard error from 1000 replications shown after ± 


e value 

NMSE 

0.000001 

3.38 x 10" 2 ±4.87 x 10 -4 

0.00001 

3.38 x 10 -2 ±4.87 x 10" 4 

0.0001 

3.38 x 10 -2 ±4.87 x 10" 4 

0.001 

3.38 x 10" 2 ±4.87 x 10 -4 

0.01 

3.39 x 10 -2 ±4.87 x 10 -4 

0.1 

3.38 x 10~ 2 ±4.81 x 10 -4 

0.2 

3.37 x 10“ 2 ±4.87 x 10" 4 

0.5 

3.42 x 10~ 2 ±5.00 x 10" 4 

1 

3.54 x 10“ 2 ±5.09 x 10 -4 

2 

3.87 x 10“ 2 ± 5.76 x 10" 4 

5 

5.21 x 10 -2 ±8.10 x 10 -4 


e value 

NMSE 

0.000001 

7.21 x 10 -2 ±8.95 x 10“ 4 

0.00001 

7.93 x 10" 2 ± 1.46 x 10" 3 

0.0001 

6.24 x 10 -2 ± 1.34 x 10 -3 

0.001 

8.39 x 10 -2 ± 2.36 x 10“ 3 

0.01 

8.32 x 10 -2 ±4.04 x 10“ 2 

0.1 

2.35 x 10 _1 ± 1.13 x 10 -2 

0.2 

1.02 x 10" 1 ±9.10 x 10" 3 

0.5 

1.45 x 10 -4 ± 1.43 x 10 -4 

1 

2.24 x 10 -6 ±3.67 x 10“ 8 

2 

2.24 x 10 -6 ±3.35 x 10“ 8 

5 

2.85 x 10“ 6 ±4.60 x 10 -8 


(a) lRDSA-AsymBer (b) 2RDSA-AsymBer 
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